Skip to content

Add Mojo language support#502

Closed
Tokarzewski wants to merge 4 commits into
DeusData:mainfrom
Tokarzewski:feat/mojo-language-support
Closed

Add Mojo language support#502
Tokarzewski wants to merge 4 commits into
DeusData:mainfrom
Tokarzewski:feat/mojo-language-support

Conversation

@Tokarzewski

@Tokarzewski Tokarzewski commented Jun 18, 2026

Copy link
Copy Markdown

What does this PR do?

Adds Mojo (Modular's Python-superset systems language) wiring — the enum, extraction spec, language entry, and tests — so the integration point is ready for the grammar to be re-vendored after provenance audit.

Provenance (grammar — not yet vendored in this PR):

  • Grammar: lsh/tree-sitter-mojo
  • Pinned commit: 33193a99afe6
  • License: MIT (forked from tree-sitter-python)
  • ABI version: 15 (compatible with current runtime ceiling)
  • Scanner type: C (no libstdc++ dependency)
  • Registries: Not listed in nvim-treesitter or Helix — community grammar
  • Regeneration: tree-sitter generate from the pinned commit (standard workflow)
  • Tracking issue: Add Mojo language support #737

Spec design

The grammar's node types mirror Python's, so the spec reuses the py_* arrays and overrides only the class types (same reuse pattern already used for CFScript→js_*). Mojo-specific divergences:

  • fn/deffunction_definition
  • struct/classclass_definition; trait and __extension get their own nodes (trait_definition / extension_definition), so traits map to the Interface label
  • compile-time alias NAME = value has no dedicated grammar node — the upstream grammar recovers it as an assignment (name still captured)

What's in this PR (wiring only)

  • CBM_LANG_MOJO enum (appended, no renumbering of persisted DBs)
  • Extraction spec in lang_specs.c
  • .mojo and .🔥 extensions + LANG_NAMES in language.c
  • scripts/new-languages.json entry
  • Tests: test_grammar_regression.c case (fn→Function, struct→Class, trait→Interface) and the matching test_grammar_labels.c golden
  • MANIFEST.md entry (marked PENDING — vendored files removed pending audit)
  • THIRD_PARTY.md and README language count

What's NOT in this PR (deferred)

  • Vendored grammar files (internal/cbm/vendored/grammars/mojo/) — removed from this PR per maintainer request. To build, copy the C parser + scanner from the pinned commit above into that directory and re-add grammar_mojo.c to the build.

Verification (original branch, before removal)

Indexed real Mojo corpora end-to-end:

  • NuMojo (pure Mojo, 135 files): functions, methods, structs, traits (Interface), decorators, calls, imports all extract; core NDArray struct correctly surfaces as the most-referenced type (in-degree 306).
  • EnergyPlusMojo + 7,354 machine-generated Mojo files: 166K+ nodes, no crashes — robust to malformed/partial generated Mojo.

@DeusData

Copy link
Copy Markdown
Owner

Hey @Tokarzewski, could you please split this such that the grammar is not vendored through the PR and basically list which grammar should be checked and added through us? We would like to first audit vendored sources ourselves. So basically: Remove the vendored items, and let us know in the PR description that the dependency for this to work is the grammar XY which you fetched from the repo mentioned.

@Tokarzewski

Tokarzewski commented Jun 24, 2026

Copy link
Copy Markdown
Author

Of course, apologies for the bloat. Will improve! @DeusData

@DeusData DeusData added enhancement New feature or request language-request Request for new language support parsing/quality Graph extraction bugs, false positives, missing edges priority/backlog Valuable contribution, lower scheduling urgency; review when maintainer capacity opens. labels Jun 29, 2026
@DeusData

DeusData commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Thanks for adding Mojo support. Before this can move forward, we need the vendored grammar provenance tightened up: source repo, exact commit, generation command/version, license confirmation, and ideally a reproducible regeneration note or checksum. Please also link the tracking issue/language request and remove generated/session attribution from the commit message/PR body.

Mojo (Modular) is a Python-superset systems language. Wire the standard
language path — enum, extraction spec, language entry, test cases, and
registration — so the hook is ready for the grammar to be re-vendored
after provenance audit.

Tracking: DeusData#737
Grammar: lsh/tree-sitter-mojo @ 33193a99afe6, MIT, ABI 15, C scanner
         (community grammar — not in nvim-treesitter/Helix registries)

The grammar's node types mirror Python's, so the spec reuses the py_*
arrays and overrides only class types:
  - "struct"/"class"      → class_definition
  - "trait"/"__extension" → trait_definition / extension_definition (Interface)
  - "fn"/"def"            → function_definition
  - "alias NAME = value"  → assignment (no dedicated node in upstream grammar)

Signed-off-by: Tokarzewski <bartlomiej.tokarzewski@gmail.com>
@Tokarzewski Tokarzewski force-pushed the feat/mojo-language-support branch from 8dcb876 to 89071d2 Compare July 1, 2026 14:02
@Tokarzewski Tokarzewski requested a review from DeusData as a code owner July 1, 2026 14:02
…support

Update branch to latest upstream/main (includes ObjectScript grammars,
git worktree support, repro framework, and other changes).
@Tokarzewski

Copy link
Copy Markdown
Author

@DeusData can you help?

@DeusData

DeusData commented Jul 1, 2026

Copy link
Copy Markdown
Owner

What exactly do u need? @Tokarzewski

…support

Resolved THIRD_PARTY.md grammar count (159→160) and README.md badge
to match upstream/main's ObjectScript additions.
@Tokarzewski Tokarzewski force-pushed the feat/mojo-language-support branch from 60af06e to 0aa1aed Compare July 1, 2026 15:20
@Tokarzewski

Copy link
Copy Markdown
Author

@DeusData I am unable to pass the CIs

@DeusData

DeusData commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Well, you need to sign your commits. That's something I cannot do for you. Otherwise the test and lint logs should be giving u hints on what you (or ur agent) should change

@DeusData

DeusData commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Thanks for asking, and sorry for the earlier short answer. I checked the logs more carefully and you were right to ask for help here.

This is not only a DCO problem. DCO still needs fixing on the branch update commit, but the larger CI blocker is that the PR wires Mojo into the compiled language/test path while the vendored Mojo grammar is absent, so the linker fails with undefined reference to tree_sitter_mojo.

That part is on the maintainer side. We have not yet had the time to audit and integrate the Mojo grammar, but we will do that now: pick the right upstream source, verify the license/provenance, run a security review on the generated/parser sources, and vendor it cleanly. Once that is in place, this PR should have a real path to green CI after rebasing/signing.

So please keep this PR open. For now, the action item on your side is only the DCO cleanup; the missing grammar integration is something we need to handle. Thanks for sticking with it.

@DeusData

DeusData commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Follow-up from maintainer side: I opened #744 to vendor the Mojo tree-sitter grammar after provenance/license/security review. Once that lands, this PR can rebase on top of it and should no longer carry the missing-grammar/linker blocker; the remaining contributor-side item here is the DCO cleanup.

@DeusData

DeusData commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Follow-up now that the maintainer-side grammar work is complete: #744 has been merged.

The Mojo grammar is now vendored in main from a pinned upstream commit after license/provenance review. The CI license/provenance gates passed, and the vendored parser sources were checked for the security concerns we care about here: no process execution, no outbound network behavior, no telemetry-style paths, and no sensitive-path/environment exfiltration logic found.

So the missing tree_sitter_mojo / linker blocker is now resolved on the project side. You should not need to carry the vendored grammar in this PR anymore; the next useful step is to rebase on current main, keep the Mojo integration changes, and clean up the DCO/sign-off issue on the remaining commits.

Thanks again for pushing this forward, and sorry again that we made you wait on a maintainer-side blocker here. Once the branch is rebased and signed, we can review the actual integration changes properly.

@Tokarzewski Tokarzewski closed this Jul 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request language-request Request for new language support parsing/quality Graph extraction bugs, false positives, missing edges priority/backlog Valuable contribution, lower scheduling urgency; review when maintainer capacity opens.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants